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Abstract 

Recent work in theoretical computer science and scientific computing has focused on nearly- 
linear-time algorithms for solving systems of linear equations. While introducing several novel 
theoretical perspectives, this work has yet to lead to practical algorithms. In an effort to bridge 
this gap, we describe in this paper two related results. Our first and main result is a simple 
algorithm to approximate the solution to a set of linear equations defined by a Laplacian (for a 
graph G with n nodes and m < n 2 edges) constraint matrix. The algorithm is a non-recursive 
algorithm; even though it runs in 0(n 2 ■ polylog(n)) time rather than 0(m ■ polylog(n)) time 
(given an oracle for the so-called statistical leverage scores), it is extremely simple; and it can 
be used to compute an approximate solution with a direct solver. In light of this result, our 
second result is a straightforward connection between the concept of graph resistance (which 
has proven useful in recent algorithms for linear equation solvers) and the concept of statistical 
leverage (which has proven useful in numerically-implementable randomized algorithms for 
qv ' large matrix problems and which has a natural data-analytic interpretation). 

o' 

m 

^ ! 1 Introduction 

O. 

The problem of approximating the solution to a set of linear equations defined by a Laplacian 

constraint matrix has been of interest recently due to a series of remarkable papers by Spielman 
and Teng [35l [37J, [36] . (This work builds on ideas originally introduced by Vaidya and developed by 
others [SI [51 [7] g) While introducing several novel theoretical perspectives, this work on "nearly- 
linear-time" algorithms has yet to lead to practical algorithms. In this paper, we describe two 
related results in an effort to bridge this theory-practice gap. 

Our first and main result, to be described in Section [21 is a simple algorithm for computing 
an approximate solution to a set of linear equations defined by a Laplacian constraint matrix. 
The simplicity of the algorithm permits us to identify a simple connection (that to our knowledge 
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1 Briefly, recall that to solve a system of linear equations, Ax — b, one can use either direct methods or iterative 
methods ,26, ;39j. Iterative methods, such as Chebyshev or Conjugate Gradients, compute successively better 
approximations to x by performing successive matrix-vector multiplications. The number of iterations typically 
depends on the condition number k(A) of A, where k(A) = Xmax(A) / \min(A) is the ratio of the extreme (nontrivial) 
eigenvalues of A, via a multiplicative factor of -^/k(A). Preconditioning refers to a class of methods to solve 
B~ Ax = B _1 6, where the preconditioning matrix B is chosen such that k(B~ 1 A) is small and such that it is 
easy to solve for Bz = c. Vaidya introduced the idea of using combinatorial methods to precondition Laplacians of 
graphs with Laplacians of their subgraphs. It is known that if one wants to precondition any symmetric diagonally 
dominant matrix, then it suffices to find a preconditioner for a related Laplacian matrix [7J; and, moreover, 
that preconditioning matrices that arise in many applications can be reduced to the problem of preconditioning 
diagonally dominant matrices [10]. Vaidya's methods have been extended [3 J9j E] , and they were used by Spielman 
and Teng to approximate the solution to diagonally dominant linear systems in time that is "nearly-linear" in the 
number of nonzero entries in their defining matrices |36j . 



has been overlooked) with other recent work in the theory and (numerical and data) application 
of randomized algorithms for matrix problems. Thus, our second result, to be described in 
Section [3l is to identify and discuss the connection between the concept of statistical leverage 
and the concept of graph resistance. The latter concept has a long history in spectral graph 
theory [TB], and recently it has proven useful in algorithms for linear equation solvers [34, 3j. 
The former concept also has a long history, but in statistics and diagnostic data analysis |15j . 
Moreover, recently, it has been demonstrated to be the key structural quantity to understand in 
order to bridge the theory-practice gap between theoretical work on randomized algorithms for 
large matrices and applications (both numerical-implementation and data-analysis applications) 
of this "randomized matrix algorithm" paradigm [28j [27J HH EH 12] ■ 

1.1 Laplacian matrices 

Consider a graph G = (V, E) with n vertices and m weighted, undirected edges. We will assume 
that all the weights are positive. Then, we can construct the so-called Laplacian matrix L £ M. nxn 
of G. Let Wij > denote the weight of the edge joining vertices i and j; clearly Wij = if no such 
edge exists. In the most common definition of the Laplacian matrix L, the off-diagonal entries 
of L (Lij, i / j) are set to —w^, while the diagonal entries La (for all i = 1, . . . , n) are equal to 
the "weighted degree" of vertex i, i.e., La = Y^j=i w ij- By definition, L is a symmetric matrix of 
rank at most n — 1, since the all-ones vector is clearly in the null space of L. 

A somewhat less common definition of the Laplacian matrix follows from the so-called edge- 
incidence matrix of the graph G. Let B £ jj mxra denote the edge-incidence matrix of the undi- 
rected graph G, constructed as follows: each row of B corresponds to an edge of G; assuming 
that an (arbitrarily-oriented) edge of G starts at vertex i and ends at vertex j, the i-ih. entry in 
the corresponding row of B is set to +1, the j-th entry is set to —1, and the remaining entries are 
all set to 0. Thus, B has two non-zero entries per row for a total of 2m non-zero entries. Also, let 
W £ jj mx "* k e a di a g ona l matrix containing the edge weights (in the same order as they appear 
in B). Then, it is well-known that 

L = B T WB. 

The above definition makes it obvious that L is a symmetric positive-semidefinite matrix. 

Note that given a Laplacian matrix L £ M. nxn corresponding to an undirected, weighted graph 
G with m edges and positive edge weights, we can immediately derive B and W. That is, by 
considering the m non-zero entries L^ with i < j, since each such entry corresponds to an edge 
joining vertices i and j of weight Wij = —Lij, we can immediately construct B and W. 

1.2 An overview of the problem 

Given a Laplacian matrix L corresponding to an underlying graph G = (V, E) with n vertices 
and m (positively) weighted, undirected edges, consider the following regression problem which 
was addressed by Spielman and Teng [351 EH ES] ■ 

Problem 1 [Least-squares approximation with Laplacian constraints] Given as input 
a Laplacian matrix L £ R nxn as described above and a target vector b £ W 1 , compute 

arg min \\Lx — b\\ . 

The minimal H.2- n orm solution vector x op t to the above problem is equal to 

x op t = L^b, (1) 

where L^ corresponds to the Moore-P enrose generalized inverse. 



This formulation is a generalization of the standard problem of solving a system of linear equations 
of the form Lx = b, in order to better handle the rank-deficiency of L. We chose this formulation 
since our algorithm will make no assumptions on the rank of L. In addition, this formulation 
will make the comparison with related work on randomized algorithms for matrix problems (see 
Section I33j) immediate. 

In this setting, Spielman and Teng [36] provided a randomized, relative-error approximation 
algorithm for Problem[TJ The running time of their algorithm is O (nnz(^4) log Cl n), where nnz(A) 
represents the number of non-zero elements of the matrix A, or equivalently the number of edges 
in the graph G, and c\ is a small constant. The first step of this algorithm corresponds to 
performing "spectral graph sparsification," thereby keeping a small number of edges from G, and 
thus creating a much sparser Laplacian matrix L. The second step of this algorithm involves 
using this sparse matrix L as an efficient preconditioner to solve Problem Q] approximately. In 
order to achieve high precision, this is done in a recursive manner. 

While [36] is a major theoretical breakthrough, its applicability is currently hindered by its 
sheer complexity. In an effort to bridge the gap between theory and practice, recent work of 
Spielman and Srivastava [34J proposed a much simpler algorithm for the graph sparsification step 
of [36J , by arguing that randomly sampling edges from the graph G with probabilities proportional 
to the so-called effective resistances (see Section 13.11 for definitions) of the edges provides a sparse 
Laplacian matrix L satisfying the desired properties. On the negative side, in order to approximate 
the effective resistances of the edges of G efficiently, the Spielman-Srivastava algorithm performs 
O(logn) calls to the Spielman- Teng solver, severely hindering its applicability [34] . We should 
also note that Batson, Spielman, and Srivastava [3] provided a more expensive algorithm for 
finding even sparser spectral sparsifiers. 

Note that the work of Spielman and Teng also addresses a much broader class of matrices, the 
so-called SDDMq class, which can be reduced to the Laplacian case. This reduction is described 
in detail in [10]. For simplicity of presentation, here we will only focus on Laplacian matrices. 



1.3 Solving systems of linear equations with Laplacian matrices 

Our main result in this paper is a simple algorithm to compute an approximate solution to 
Problem [H As with previous algorithms, the first phase will sparsify the input graph, and the 
second phase will solve the problem on the sparsified graph. Our main algorithm will be described 
in detail in Section [2j Briefly, in the first phase we will compute a nonuniform sampling probability 
distribution that depends on the so-called statistical leverage scores [23|. [27] associated with the 
weighted edge-incidence matrix of the input graph. We will then sample a "small" number of edges 
according to that distribution to construct a sparsified Laplacian matrix L, having O (j log ^) 
non-zero entries. Then, in the second phase we will solve the sparsified problem 



arg mm 

x£R n 



Lx — b 



(2) 



to get the vector x opt = L^b. The resulting vector x opt satisfies (with constant probability) 

1 1 *E opt •EoptUL — ^ ll^optlli ' \ ) 

Recall that the "energy norm" \\x\\l for any vector x £ M. n and any matrix L £ R nxn is equal 
to x 1 Lx. Given the sparsified Laplacian L, this second phase will use the conjugate gradient 
method as a direct solver [39 1 to solve the sparse least-squares problem of eqn. ([2]), and thus it 
will take O ( ^- log j I time. For dense graphs, this matches the running time of the Spielman- 
Teng algorithm, while for sparse graphs the Spielman- Teng algorithm is still faster. 



The question of computing the statistical leverage scores (either exactly or approximately) 
is a subtle one, and it is related to the theory-practice disconnect — both for this problem, as 
well as for other problems to which randomized matrix algorithms have been applied. Thus, we 
will discuss this topic in greater detail in Section [3.31 Briefly, 0(mn 2 ) time certainly suffices to 
compute them with standard methods; theoretically, they can be computed in 0(m log Cl n) time, 
for some small constant ci; and they can be efficiently approximated in the presence of certain 
resource constraints. 

2 An algorithm for solving systems of linear equations 

In this section, we will describe our main algorithm to approximate the minimal 0.2-n.oxm. solution 
vector x op t of the least-squares approximation problem with Laplacian constraint matrix (Prob- 
lem [1]). Then, we will state and prove our main quality-of-approximation theorem and discuss 
the running time of the proposed algorithm. 

2.1 Our main algorithm 

Algorithm [1] takes as input an n x n Laplacian matrix L (corresponding to a graph G with n 
vertices and m positively weighted, undirected edges) and constructs annxji sparsified Laplacian 
matrix L. Finally, it computes the minimal ^-norm solution vector x op t of the sparsified problem 
with a direct solver. 

In more detail, the algorithm first computes the edge incidence matrix B and the correspond- 
ing diagonal weight matrix W, as described in Section [TTT1 Then, it computes a set of probabilities 
Pi,P2, ■ ■ ■ ,Pm such that the i-th edge of the graph, i.e., the i-th row of the edge-incidence matrix 
B and the corresponding weight Wu, will be retained with probability proportional to pi. These 
probabilities satisfy eqn. Q and depend on the so-called statistical leverage scores of the matrix 
W ' 2 B. As we will observe in Section 13.21 these scores are proportional to the effective resis- 
tances of the edges of graph G. The parameter j3 at Step 3 of Algorithm [I] facilitates the use of 
approximate (as opposed to exact) probabilities and will be further discussed below. It is worth 
noting that computing the aforementioned probabilities pi exactly (/3 = 1) necessitates 0(mn 2 ) 
time, which is prohibitive for the proposed application^ 

After setting the sparsity parameter r to an appropriate value that guarantees a relative-error 
approximation to the optimal solution at Step 4, exactly r edges of G are sampled (Step 6) with 
respect to the computed probabilities. The weights of the retained edges are rescaled (Step 6) and 
the induced Laplacian L corresponding to the sparsified graph is formed. Note that L G R nxn 
has at most n + 1r non-zero entries, since its underlying sparsified graph has at most r edges. 
Then, the sparsified problem 

(5) 



arg mm 



Lx 



is solved in order to return the minimal ^-norm solution x op t = Ub. The computational sav- 
ings emerge since the sparsified problem can be solved efficiently using, for example, conjugate- 
gradient-type methods as direct solvers. The running time of such methods with input L and b 
is O (n(n + 2r)), where n + 1r is the number of non-zero entries in L. 



2 Indeed, one goal of this work is to focus further research towards efficient — either provably accurate or 
heuristic — algorithms to approximate these leverage scores in various settings, thereby leading to faster algorithms 
for this and related problems. 



Input: Laplacian matrix L G M nxn , corresponding to a graph G with n vertices and m 
(positively) weighted edges, b G W 1 , and accuracy parameter e G (0, 1). 

Output: x opi G R n . 

1. Compute the edge- incidence matrix B G flj mxn an d the diagonal edge- weight matrix 
w eW mxm ( gee Section KJJ). 

2. Let $ = W 1 / 2 ^ G M mxr \ 

3. Compute a set of probabilities pi (for all i = 1 . . . m) such that Y^JiLxPi = -*- an d 



Pi> — 



(^*)(i) 2 



irr II 2 



(4) 



for some /3 G (0, 1]. (£/$ is an orthogonal basis for the column space of <1> and (U$)/j\ 
is the i-th row of U$.) 

4. Set r = y n log ( f° n J , where Co is the unspecified constant of Theorem [2j 

5. Initialize 5 G R mxr to be an all-zeros matrix. 

6. For i = 1, . . . ,r do 

• Pick zj G 1 . . . m, where Prob (it = i) = pc, 

• S itt = l/y/rp^; 

7. Compute L = {B T W 1 ' 2 S) {S T W 1 ' 2 B) G K nxn . 

8. Return x op t = L<b. 



Algorithm 1: Approximating the minimal ^2-norm solution of least-squares problems with Lapla- 
cian constraint matrices. 

2.2 Approximation accuracy 

The following theorem is our main quality-of-approximation result for Algorithm [TJ 

Theorem 1 Given Laplacian matrix L G M™ x?1 (corresponding to a graph G with n vertices and 
m positively weighted edges) and a vector b G M. n , let x op t G W 1 be the solution vector of eqn. {7]j. 
If x op t ^ R n is the output of Algorithm^ for some choice of the accuracy parameter e G (0,1), 
then, with probability at least 2/3, 

1 1 X pt X pt 1 1 £ — ^ 1 1 Xopt Wl • 

Proof: By definition, \\x op t - x op t\\ L = (x op t - x op t) L (x op t - x op t)- Recall that L = B T WB, 
where B G W mxn and W G M. mxm are the edge- incidence and the diagonal weight matrix respec- 
tively (see Section fl.ip . Also recall that the diagonal entries of W are positive and thus W 1 ' 2 is 



well-defined. Then, 

1 1 X opt %opt\\L 



{x op t ~ Xopt) B T WB (Xopt - Xopt) 



W l / 2 B(, 



opt Xopt, 



T 



W^ 2 B (x pt - Xopt) W^ 2 B (x pt - xopt. 



(6) 



We now use the formulas for Xopt and x op t, namely x op t = L^b (from eqn. (p}) and Xopt = L^b. 
Let $ G R mxn denote the matrix W 1 I 2 B and let the SVD of $ be 



$ = U$E V4 . 
Here U$ G W mx P, £$ G RP X P, and V$ G R nx ^, with p < n being the rank of $. Then, 



(7) 



and thus 
Similarly, 
and 



L = $ T $ = T/$S|VJ, 



xopt = tfb = V^EjVJfc. 



(8) 



.1 ,-; 



J/C-T^tT 



r\t 



opt = (5 J $) ' (5 J $) ' 6 = (5 J C^E$^ ) ' (S 1 U$X*Vi 



rT\W 



b. 



(9) 



Combining eqns. ©, 

H^opt 3J opt || £ 



©, ©, and ©, we get 

u$x*v£ (y^ 2 vgb- {s T u^v£) ] (s T u^v£) ]T b 

V^Vgb - £$ (S T £/$£$) f ( S T £/$£$) fT Vjfc 



(10) 



In the above we used the facts that £/$ and V$ are orthogonal matrices, and (XV T ) = VX* for 
any orthogonal matrix V. We now employ Theorem [2] of the Appendix in order to argue that 
S U§ is a matrix whose singular values are all close to unity. (This theorem is a variant of a 
result of Rudelson and Vershynin |31| that was proven as Theorem 4 in the appendix of |24j.) 
More specifically, since U^U§ = I p , Theorem [2] argues that with our choice of r at Step 4 of 
Algorithm Q] 

B[\\ulSS T U^-I p \\ 2 ] <^. 

L II f-IIZJ ^ 

Markov's inequality now implies that with probability at least 2/3 

\\U^SS T U^-I P \\ 2 <^. 
ii f-nz 2 

Using standard perturbation theory [38], we get that for all i = 1, . . . , p, 

\<Ti {ulss T u„) - 1| = K 2 {s T u$) - 1| < ^ 



(11) 



(12) 



holds with probability at least 2/3. (Here <Ji{X) denotes the z-th singular value of X.) This 
implies that the m x p matrix S U$ has rank p with probability at least 2/3. The remainder of 



the proof will be conditioned on this event holding. Using (S T [/$£$) = S, (S T U$,) (which is 
only true if S U§ has full rank), eqn. f|10[) becomes 



j opt 



x, 



opt\\L 



We now focus on the matrix U = S 1 Uq, G W nxp . Let its SVD be 

sFu* = n = u n z n v£. 

Since the rank of S T U$ is p, it follows that U Q G R mx P, E n G M^, and Vh G 
rewrite eqn. (|13p using the SVD of O: 



(13) 

(14) 
<p . We now 



(15) 



Ikopt - x opt \\ L = W^^V^b - VaT, n 2 VQ'S^ 1 V^b\\ 2 . 

Let Sq = I p + £", for some diagonal error matrix E. Using VqVq = VqVq = I p , eqn. (TT5 
becomes 
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X 



opt\\L 



< 



Y^Vgb -V n (I + E) VgX^VgbWl 

VnEV^VgbWl 

EVgV^VgbWl 



EV ( 



Til 2 



& \\2 



vi x vgb 



p||2 ||v- 1 T/ T ?,ll 2 
tj \\2 11^$ V $ b \\2 



(16) 



We now seek to bound the spectral norm of the diagonal matrix E. Notice that the diagonal 
entries of E satisfy 



I Ei, 



C; 



(O) — 1| = \af (S T U^) - 1| 



Using the bounds of eqn. ([12]) we get 

||£|| 2 = max \a~ 2 (SFU*) - 1 

i=\...p 

af (S T U$) - 1 



max 

i=i. ..p 



< 



Ve/2 



a 2 (SPU*) 



1 " (v^/2) 
The last inequality follows since e < 1. Combining eqns. (|16p and ()1T[) . we get 



I ■*- opt x opt ||^ Ji t II * * II 2 



(17) 



(18) 



To conclude the proof, notice that using <3? = W l ' 2 B and eqns. ([7]) and 

rp 

II"^ P*IIl — X p^ljX pf 

^opt 



we get 



W^flzcpt) flU 1/2 Bx, 



|$x opt || 2 

|y-V T /)ll 2 
I * * lb ' 



(19) 



Combining eqns. (|18p and (|19p concludes the proof of the theorem. 



2.3 Running time 

We now discuss the running time of Algorithm [TJ Steps 1 and 2 are trivial and run in 0(m) 
time. Step 3 necessitates the computation of a probability distribution over the rows of BW 1 ' 2 . 
Theoretically, this step runs (for /3 = 1) in 0(mlog Cl n) time, for some small constant c\, as 
described in [3j. (However, in order to achieve this running time it is necessary to perform 0(log n) 
calls to the Spielman-Teng solver, which essentially renders this computation impractical. Below, 
we will discuss in more detail several issues related to computing these probabilities in other 
ways.) Steps 5, 6, and 7 run in 0(m) time, since B is a matrix with two non-zero elements per 
row, W is a diagonal matrix, and the sampling matrix S simply reduces the number of rows in 
BW 1 ' 2 from m to r. Finally, at the last step, we invoke a direct solver for the sparse least-squares 
problem of eqn. ([2]), which takes O ( — log — J time. Thus, from a theoretical perspective, using 

the fact that rn < n 2 , the running time Algorithm Q] is O I — (log -) (log Cl n] 

3 Connecting graph resistances and statistical leverage scores 

In this section, we will show that the effective resistances of the edges of a graph G with n vertices 
and m positively weighted undirected edges are proportional to the statistical leverage scores of 
the rows of the matrix W l ' 2 B (recall our definitions in Section [l.ip . Although this connection 
is straightforward from technical perspective, it is of considerable interest due to the insights it 
provides. 

3.1 Review of effective resistance and statistical leverage 

We start with the following definition of the effective resistance of an edge of a graph: 

Definition 1 Given G = (V,E), a connected, weighted, undirected graph with n nodes, m edges, 
and corresponding edge weights w e > 0, for all e E E, let 

L = B T WB (20) 

denote the nx n Laplacian matrix of G (see Section \l.l\ for notation). The effective resistances 
R e across all edges e E E are given by the diagonal entries of the matrix 

R = BL ] B T , (21) 

where L^ denotes the Moore-Penrose generalized inverse of L. 

Clearly, from standard matrix algebra, the effective resistances of all the edges of G can be 
computed in 0(n 3 ) time. Moreover, if we let G denote an electrical network, in which each edge 
e E E corresponds to a resistor of resistance l/w e , then the effective resistance R e between two 
vertices can be defined as the potential difference induced between the two vertices when a unit 
of current is injected at one vertex and extracted at the other vertex. Finally, effctive resistances 
have a wide range of applications, including not only theoretical applications such as analyzing 
diffusion processes and random walks on graphs, but also very practical applications such as 
analyzing clustering and community structure in large informatics networks. 

A seemingly-unrelated notion is that of the statistical leverage scores of the rows of a matrix: 



Definition 2 Given anmxn matrix A, with m > n, the statistical leverage scores of the rows of 
A are the m diagonal elements of the projection matrix onto the span of the columns of A. That 
is, if the matrix V a denotes any orthogonal basis for the column space of A, then the diagonal 
elements of the projection matrix Pa onto the span of those columns are given by 

(Pa)u = (UaUDu = || (C^)(i) H2, 
where {Ua)u) denotes the i-th row of the matrix V a- 

Clearly, all the statistical leverage scores can be computed in 0(mn 2 ) time. Note that these 
scores could be defined for any m x n matrix A with m < n. In that case, however, if A is 
not rank-deficient, then all the scores are trivially equal to unity. Importantly, the statistical 
leverage scores have a natural interpretation in terms of "importance" or "influence" or "lever- 
age" of the corresponding constraint/row of A in the overconstrained least squares optimization 
problem min x \\Ax — b\\ 2 - As such, they have been of interest historically in diagnostic regression 
analysis [15]. 

More generally, given a rank parameter k, one can define the statistical leverage scores relative 
to the best rank-k approximation to A to be the m diagonal elements of the projection matrix 
onto the span of the best rank-/c approximation to A. These generalized scores have been used 
recently as importance sampling probabilities to obtain relative-error approximation algorithms 
for regression \12\ I24j , and they were essential for for the extension of these ideas to relative-error 
low-rank matrix approximation |23}I27| problems. Prior work [14, 2j has also used term incoherent 
to refer to the situation when no leverage score is particularly large. 

3.2 A simple lemma 

We now describe a connection between graph resistances and statistical leverage scores. Although 
this connection is not so surprising from a technical perspective — indeed, it is obvious once it is 
pointed out — it is useful for the insights it provides. 

Lemma 1 Let the matrix <£ = W 1 ' 2 B £ ]^ m>< » 1 denote the edge-incidence matrix of a graph G 
rescaled by W 1 ' 2 . The statistical leverage scores associated with <3? are (up to scaling) equal to 
the effective resistances of all edges of a weighted graph G. That is, if ti is the leverage score 
associated with the i-th row of <I>, then ti/wi is the effective resistance of the i-th edge. 

Proof: Consider the matrix 

P = W 1/2 B(B T WB) + B T W 1/2 £ R mxm , 

and notice that P = W 1 ' 2 RW 1 ' 2 is simply a rescaled version of the m x m matrix R = BL + B T , 
whose diagonal entries are exactly equal to the effective resistances of all the edges of G. Since 
$ = W 1 / 2 B, it follows that 

P = $($ T $)+$ T . 

Let C/$ denote an orthogonal matrix spanning the column space of $. Then P = U$U$, from 
which it follows that the diagonal elements of P are equal to 

Pii = (U<5>Uq,)n = ||(£/*)(i)|| 2 • 
This concludes the proof of the lemma. 



3.3 Usefulness of statistical leverage in randomized matrix algorithms 

The connection between statistical leverage and effective resistance is of interest in attempts 
to make nearly-linear-time linear equation solvers more practical. The reason is that statisti- 
cal leverage has proven to be the key structural quantity to understand in order to bridge the 
"theory-practice gap" between theoretical work on randomized algorithms for large matrices, and 
applications (both numerical-implementation and data-analysis applications) of this "randomized 
matrix algorithm" paradigm |28[ [27] [23] [30], [2]. In this section, we review some of the "lessons 
learned," in the hope that they provide insights on how to to bridge the theory-practice gap for 
solving linear equations defined by a Laplacian constraint matrices. 

Recall that much work, including, e.g., our previous work |19[ [20] 121 j . followed that of 
Frieze, Kannan, and Vempala [25], in which columns and/or rows from a matrix A are ran- 
domly sampled according to a probability distribution that depends on the Euclidean norms of 
those columns/rows. In this case, worst-case additive-error guarantees of the form 

\\A - P c , k A\\ F < \\A - A k \\ F + e \\A\\ F (22) 

can be obtained, with high probabilityo Although these algorithms were motivated by resource- 
constrained computational environments, they have several drawbacks with respect to numeri- 
cal applications and data analysis applications more generally. First, worst-case additive-error 
bounds are quite coarse. Second, these algorithms were not immediately-relevant to common 
problems, as they are typically formulated, in scientific computing and numerical linear algebra. 
Third, the insights provided by the sampling probabilities into the data are limited — the proba- 
bilities are often uniform due to data preprocessing, or they may correspond, e.g., simply to the 
degree of a node if the data matrix is derived from a graph. 

Importantly, each of these three problems was solved by the introduction of importance sam- 
pling probabilities that depend on the statistical leverage scoreso 

• First, by using importance sampling probabilities that depend on the leverage scores, it 
was shown |23[ [27] that one could randomly sample a "small" number of columns to obtain 
worst-case relative-error guarantees of the form 

\\A - P c , k A\\ F < (1 + e) \\A - A k \\ F , (23) 

with high probability. 

• Second, algorithms that were comparable to or better than previously-existing algorithms 
were provided for the following two very traditional scientific computing problems: 

— Overconstrained Least Squares. Let A be an m x n matrix A, with m ^> n, 
and consider solving x opt = arguing \\Ax — b\\ 2 - In previous work [22j [231 G3j we 
proposed a simple, sampling-based, algorithm for solving this problem: first, compute 
the statistical leverage scores of the rows of A; then, use these scores to construct an 
importance sampling probability distribution to sample a "small" number of rows of 
A and the corresponding elements of b; and finally, solve the induced, much smaller 
but still overconstrained, regression problem using only those (suitably rescaled) rows 



3 



Here Pc,kA denotes the projection of A on a rank-fc subspace spanned by the columns of C. 
4 Although these probabilities were introduced in |22II23| and were used in solving two very traditional numerical 
linear algebra problems in [2411 12] . the connection with leverage scores wasn't made explicit until [27] . 
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of A and the corresponding elements of b. Strong relative error guarantees for this 
overconstrained_| regression problem were proven with this approach [22, 23J. 

— Column Subset Selection Problem. Let A be an m X n matrix, and let A; be a 

positive integer. Then, pick k columns of A forming an m x k matrix C such that 
the residual \\A — Pc vl|L, where £ = 2 or F denotes the spectral norm or Probenius 
norm, is minimized over all possible (?) choices for the matrix C. Previously [121 111]. 
we developed a two-phase algorithm that uses the nonuniformity structure defined by 
the statsitical leverage scores in an essential way to provide theoretical and empirical 
results for both the spectral and Frobenius norm that were competitive or better than 
previously existing results. 

• Third, the insights into the matrix provided by statistical leverage scores (in both numerical 
and data applications) can be quite refined. The insights are used in very different ways, 
depending on whether one is interested in high-quality numerical implementations or large- 
scale data analysis applications. 

— Numerical Implementation Applications. Here, one wants to provide fast high- 
quality numerical implementations, and one is typically interested in the error parame- 
ter to be bery small, e.g., e ~ 10 -16 . For example, with respect to the overconstrained 
least-squares regression problem, performing an exact computation of the statistical 
leverage scores of the rows of A is no faster than exactly solving the original regres- 
sion problem. Sarlos |33} [2^] addressed this problem by preprocessing the matrix A 
and the vector b with the randomized Hadamard transform of Ailon and Chazelle [1] . 
This preprocessing step made the statistical leverage scores almost uniform — effectively 
"washing out" any nonuniformities defined by the leverage scores, thereby densifying 
the matrix if it was sparse — thus leading to the first randomized, relative-error algo- 
rithm for least-squares problems that runs asymptotically faster than @(mn 2 ) time. 
High-quality implementations of such algorithms have appeared [301 [2] , and they high- 
light the significant practical applicability of this approach. 

— Data Analysis Applications. Here, one may want e ~ 0.1, and one is typically 
interested in obtaining insight with respect to some downstream data analysis goal. 
In such cases, SVD-based methods are often chosen for computational convenience, 
rather than because the statistical assumptions underlying their use are satisfied by 
the data — a fact which means that the leverage scores are often extremely nonuniform 
in a way that correlates strongly with what practitioners know about the data [281 l2?l 
ITT| [13] problems. Thus, far from "washing out" this nonuniformity structure, one is 
interested in identifying and exploiting it. Intuitively, conditioned on being reliable, 
more "outlier-like" data points may be the most important and informative. 

This brings us to the question of how to compute these statistical leverage scores, or equiva- 
lently the effective resistances, which is an issue that gets to the heart of the theory-practice gap. 
Depending on the application and the resource constraints, there are several alternatives: 



5 Note that it is easy to show that similar results hold for the very underconstrained problem. Let A be 
an m x n matrix, with m <C n, and consider the problem of finding the minimum-length solution to x opt — 
argmin^llylx — 6|[a = A + b. Sampling variables or columns from A can be represented by postmultiplying A by 
a n x c (with c > m) column-sampling matrix S to construct the (still underconstrained) least-squares problem: 
x op t = aigmm 7 .\\ASS T x - b\\ 2 = A T (AS) T+ (AS) + b. The second equality follows by inserting P a t = A T A T+ to 
obtain ASS T A T A T+ x — b inside the 1 1 • 1 12 and recalling that A + — A T A T+ A + for the Moore-Penrose pseudoinverse. 
If one randomly samples c = 0((n/e 2 ) log(n/e)) columns according to "column-leverage-score" probabilities, i.e., the 
diagonal elements of the projection matrix onto the row space, then it can be proven that | \x op t — x op t\ 
holds, with high probability. 
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• Compute the scores by calling the Spielman-Teng nearly- linear time solver. This algorithm 
runs in O (nnz(A) log Cl n) time, where nnz(A) represents the number of non-zero elements 
of the matrix A, or equivalently the number of edges in the graph G, and c\ is a small 
constant. This method works for computing the leverage scores of Laplacian matrices; and 
in this case it is, theoretically, the best method. 

• Compute the scores by computing an "exact" basis for the column space of the m x n 
matrix <£ = W 1 ' 2 B. This takes 0(mn 2 ) time and works for general matrices. For Laplacian 
matrices it is clearly expensive, given that the weighted edge-incidence matrix is very sparse. 

• Compute an approximation to the scores based on iterative sampling and volume sampling 
ideas that have been used in relative-error low-rank matrix approximations [1S\ I17|. This 
might be of interest if a pass-efficient model is an appropriate model for data access. 

• Compute an approximation to the scores based on numerical methods to, e.g., compute 
an estimator for the diagonal of a matrix [5j. These numerical methods are particularly 
appropriate for large matrices when matrix-vector products are easy to evaluate; they have 
proven useful in uncertainty quantification [4j; and they draw on the observation that the 
leverage scores, being proportional to the diagonal elements of a projection matrix, have 
a natural interpretation in scientific computing in terms of density matrices and Green's 
functions |32|. 



These alternate approaches are of particular interest since data points with high leverage scores 
often have natural interpretations in terms of processes generating the data matrices [27] . More- 
over, an examination of the details of these methods illustrates that problems are parameterized 
within theoretical computer science in very different ways than they are parameterized in scien- 
tific computing. Finally, an important issue to keep in mind is that in most applications, one 
does not need a uniformly good approximation to all the leverage scores, but instead one needs 
a good approximation only to the "high leverage" data points. 

4 Conclusion 

Several open problems suggest themselves. On the theoretical side: Can one draw on the original 
ideas of Spielman and Teng in order to develop an algorithm with the simplicity of ours and 
with the running time approximation of theirs? Similarly, can we get the 0{n log n) factor, which 
currently is due to the result of Rudelson and Vershynin [31], down to 0(n), even for some 
classes of graphs, thereby obtaining a more immediately practical version of the result of Batson, 
Spielman, and Srivastava [3]? On the more applied side: How rapidly can we approximate (even 
with a one-sided approximation) the statistical leverage scores, either for general mxn matrices A 
and arbitrary rank parameter k, or under some realistic generative model? Similarly, can one use 
the connection between statistical leverage and effective resistance to design improved heuristics, 
given knowledge about the processes generating the data? 

We conclude by noting that the last two questions are of particular interest. Although much of 
the recent work on using Laplacian preconditioners has focused on nearly-linear-time solvers for 
computing "exact" solutions, i.e., with the error parameter e set to machine precision, there are 
many other applications of these ideas. For example, in machine learning, Ravikumar and Lafferty 
used preconditioner approximations for doing approximate inference in probabilistic graphical 
models [29]. This connection should not be surprising, as much of the work on the "randomized 
algorithms for matrices" paradigm has been motivated by large-scale data applications. In many 
of these data analysis applications, however, not only is setting e = 1CT 16 not of interest, doing 
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so would actually lead to "worse" answers than setting it, say, as e = 0.1. If other recent 
applications of the randomized algorithms paradigm are any guide [301 123 E] , then the issues 
that will arise when thinking of e as extremely small and trying to couple newer randomized 
algorithmic methods with traditional numerical methods [30] [2] will be very different than the 
issues that arise in applications where the data are much less well-structured and much-coarser 
e's are of interest [2"7]l6]. 
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Appendix 

Let A G flj mxn De an y matrix. Consider the following algorithm, which is essentially the algorithm 
in page 876 of [23] . This algorithm constructs a matrix C G W nxc consisting of c sampled and 
rescaled columns of A. 



Data : A G R mxn , Pi>0,i€ [n] s.t. £ ie[n] Pi 


= 1, positive integer c < n. 


Result : C G M mxc 




Initialize S G M. mxc to be an all-zero matrix. 




for t = 1, . . . ,c do 






Pick it G [n], where Prob (it = i) = pf, 






Si t t = l/y/cpu; 




end 




Return C = AS; 





Algorithm 2: The Exactly(c) algorithm. 

Next, we state a theorem that provides a bound for the approximation error ||^4^4 — CC L. 
We used this in the proof of our main theorem in Section [2] in order to argue that the singular 
values of the "sampled orthogonal" matrix S U$ are all close to unity. In this form, the theorem 
was proven as Theorem 4 in the Appendix of [24J, but it is a variant of the well-known result of 
Rudelson and Vershynin [31 j . 
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Theorem 2 Let A G M mxn with \\A\\ 2 < 1. Construct C using the Exactly(c) algorithm and 
let the sampling probabilities pi satisfy 

iuwii 2 

Pi > P^TT-J 1 ( 24 ) 

for all i G [n] /or some constant f3 £ (0,1]. Let e G (0,1) be an accuracy parameter, assume 
Cq \\A\\ 2 f > 4/3e 2 , and let 

^r 2 \\A\\ 2 \ /r 2 H4ll 2 \ 

(in 'l F \ , / Ml K 1 F 1 



II^IIF 



c=2P^k« 



\ /3e 2 y \ /3e 2 / ' 
('if ere Co is the unknown constant of Theorem 3.1, p. 8 of JffT)/. ) Then, 

V[\\AA T -CC T \\ 2 ] <e. 

Finally, it is worth noting that the condition Cq ||.A||^ > 4/3e 2 is trivially satisfied for any matrix 
A such that ||^4||^ > 4 assuming cq > 1. 



16 



